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INTRODUCTION 

Testing  the  equality  of  two  or  more  population  means  is  a  problem 
that  occurs  In  nearly  all  disciplines.  The  usual  analysis. of  variance 
F  test  for  equality  of  means  is  based  on  three  assumptions: 

(1)  independent  random  samples  are  selected  from  the  populations, 

(2)  the  data  from  each  population  are  normally  distributed,  and 

(3)  the  population  variances  are  equal. 

This  study  focus's  on  the  violations  of  assumption  (3),  the  equal 
variance  assumption.   Various  alternatives  to  the  usual  analysis  of 
variance  F  test  which  have  appeared  in  the  literature  will  be 
presented  and  their  properties  discussed.   Recommendations  will  be 
made  concerning  the  use  of  these  tests. 

For  an  experimenter  trying  to  choose  an  acceptable  alternative 
for  the  usual  analysis  of  variance  F  test  there  are  two  topics  of 
concern,  robustness  and  power.   Robustness  refers  to  the  ability  of 
the  test  statistic  to  hold  the  Type  I  error  rate  at  the  desired  level 
when  basic  assumptions  are  violated.   The  power  of  a  test  statistic  is 
its  ability  to  detect  differences  among  population  means  when  there 
are  differences  among  them. 

If  a  test  is  to  be  applied  in  a  range  of  circumstances  in  which 
assumptions  are  violated,  the  test  must  be  robust.   Lack  of  control  of 
Type  I  error  rates  will  make  a  test  unacceptable  to  the  applied 
researcher.   Among  tests  that  are  robust,  power  can  be  used  as  a 
criterion  for  choosing  among  them. 


One  of  the  common  beliefs  about  the  unequal  variances  situation 
is  that  if  the  treatments  have  equal  sample  sizes  then  the  usual  ANOVA 
F  test  is  satisfactory  in  some  sense.   The  findings  of  this  report 
bring  the  conventional  wisdom  into  question.   A  most  important  point 
for  a  researcher  is  not  to  choose  an  alternative  test  solely  on  the 
basis  of  robustness  but  under  careful  consideration  of  the  power  of 
the  test  statistic  also.   The  primary  focus  of  this  report  is  the 
power  of  alternatives  to  the  usual  ANOVA  F  test  when  the  equality  of 
variance  assiomption  is  violated. 


ALTERNATIVE  METHODOLOGIES 


Let  X.,  be  the  j    observation  in  the  i    group,  where  j  =  1,.. 
n^^  and  i  -  l,...,k.   The  x..'s  are  assumed  to  be  independent  and 

2 
normally  distributed  with  expected  values  ^i.    and  variances  a.  .      The 

•  •  2 

minimum  variance  unbiased  estimates  of  /i.  and  a.    are, 


^i."  .^  ''ij/^i  ^""^  ^i  "  ^   ^'^ii  ■  ^  ^/("i  "  ^^   respectively. 

Various  alternatives  to  the  usual  ANOVA  F  test  are  presented.   A 
numerical  example  is  given  to  illustrate  the  computation  involved  for 
each  test.   In  order  to  demonstrate  numerically  how  to  perform  the 
various  tests  the  data  set  in  Table  1  was  used  in  each  case. 

TABLE  1.   Data  for  Examples. 
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1)  THE  ANALYSIS  OF  VARIANCE  METHOD 

The  usual  analysis  of  variance  F  test  for  testing  the  equality  of 
population  means  in  a  completely  randomized  design,  with  one-way 
classification,  is  given  by  the  following  equation, 

^     -     -   2 

E  n  (x    -  X   )V(k  -  1) 


^  2 

S   (n.  -  1)  s:/(N  -  k) 

i-1   ^       ^ 


k  _     k   n. 

where   N  -  S  n.     and    x  -  S   S"""  x   /N 
i-1  '  • •   i-1  j-1   ^J 

When  all  population  means  and  variances  are  equal,  this  statistic 

follows  the  F  distribution  with  (k  -  1)  and  (N  -  k)  degrees  of  freedom 

respectively.   This  will  be  referred  to  as  the  ANOVA  F  statistic. 

An  Example  of  ANOVA 
An  example  of  computations  for  the  ANOVA  F  statistic  are 
calculated  using  the  data  in  TABLE  1, 

N  -  7+6+8+8  -  29        x  -  281/29  -  9.690  '   ' 

F  _  f  7  (4.  571 -9.  69')  ^+6  f  11.  667-9.  691  ^+8  (8.  625-9.  69^  ^+8  (13.  75 -9.  69^  ^1/3 
[6(16.286)  +  5(1.867)  +  7(9.696)  +  7(2.786)1/25 

-  115.941/7.777  -  14.908. 

The  ANOVA  F  value  of  14.908  would  be  compared  to  a  critical  value 

denoted  as  F  (3,25)  where  a   is  the  probability  of  a  Type  I  error 

chosen  before  hand,  and  3  and  25  are  the  degrees  of  freedom  of  the 

.05^ 


the  null  hypothesis  that  all  means  are  equal  would  be  rejected  at  the 
.05  level  of  significance. 

2)  THE  METHOD  OF  BOX 

Box  (1954)  proposed  a  procedure  that  requires  the  computation  of 
the  ANOVA  F  Statistic  but  adjusts  the  critical  value  and  the  degrees 
of  freedom  to  account  for  the  unequal  variances.   Box  proved  that  a 
bias  coefficient,  b,  determines  the  direction  of  the  discrepancy 
between  the  actual  probability  of  the  Type  I  error  rate  and  the 
nominal  Type  I  error  rate.   The  b  coefficient  is  approximately  the 
ratio  of  the  unweighted  and  weighted  means  of  the  population 
variances.   When  the  group  sample  sizes  (n.'s)  are  equal,  the  weighted 
and  unweighted  mean  variances  are  equal,  hence  b  -  1.0.   When  the 
group  sample  sizes  are  unequal,  unless  the  variances  are  homogeneous, 
b  may  be  either  greater  than  or  less  than  1.0.   When  b  ^  1.0,  the 
ANOVA  F  statistic  will  be  biased.   Box  showed  that  the  mean  square 
ratio  of  the  ANOVA  F  statistic  is  approximately  distributed  as 
bF(h' ,h)  where  h'  and  h  represent  reduced  degrees  of  freedom  and  F 
represents  an  F  random  variable.   Although  Box  defines  b,  h'  and  h 
from  the  population  variances  it  is  possible  to  substitute  the 
estimated  variances  from  the  sample  data  in  the  following  equations; 


k  k 

^-i^  [.f^  (N  -  n.)s2]/[^s^  (n.  -  l)s2] 

h'  -  [  2   (N  -  n.)S^]^/[(  S  n.S^)^  +  N  S   (N  -  anjs'!] 
i-1  1-1   ^  "■      1-1        ^   "■ 

k  k 

h  -  [  S   (n   -  l)sh^/l    S   (n   -  1)S*1. 
1-1    "■      ■-     1-1    "■      "- 


An  Example  of  Box's  Method 

An  example  of  computations  for  Box's  method  are  calculated  using 

the  data  in  TABLE  1, 

5  _   f25^   [22(16.286)  +  23d. 867^  +  21^9.696-)  +  21(2.786)1 
29(3)   [  6(16.286)  +  5(1.867)  +  7(9.696)  +  7(2.786)] 

-  [25/29(3)] [663.355/194.425]  -  0.980 

jj,  _   [22(16.286)  +  23(1.867)  +  21(9.696)  +  21(2.786)1^ 

([7(16.286)  +  6(1.867)  +  8(9.696)  +  8(2.786)]^  + 
29[15(16.286)^+  17(1.867)^+  13(9.696)^+  13(2.786^]) 

-  (663.355)V((225.06)^+  (29)5360.828]  -  2.135 

[j  _   [6(16.286)  +  5(1.867)  +  7(9.696)  +  7(2.786)1^ 
[6(16.286)^+  5(1.867)^+  7(9.696)^+  7(2.786)^] 

-  (194.425)^2321.251  -  16.285 

bF  g^(h',h)  -  bF  ^^(2. 135, 16. 285)  -  .980(3.552)  =  3.481. 

Recall  the  ANOVA  F  -  14.908.   Box's  method  yields  a  critical  value  of 
3.481  so  the  null  hypothesis  would  be  rejected.   Most  computer 
packages  will  do  decimal  degrees  of  freedom.   A  conservative  critical 
value  could  be  obtained  by  checking  the  critical  values  corresponding 
to  integer  degrees  of  freedom  on  either  side  of  the  fractional  degrees 


of  freedom  and  chosing  the  largest  value.   For  this  example  2  and  16 

degrees  of  freedom  would  be  the  conservative  degrees  of  freedom. 

It  should  be  noted  that  in  the  equal  sample  size  case,  after  some 

algebraic  manipulation,  the  critical  value  reduces  to  F  { (k  -  1)6', 

a 

(N  -  k)e]  where  e'    and  e,    the  factors  by  which  the  degrees  of  freedom 

are  reduced,  are  given  by 

£'  =  1/(1  +  C(k  -  2)/(k  -  l))c^)    ,     e    =  1/(1  +  c^) 
and  c  is  the  coefficient  of  variation  of  the  variances.   That  is  to 
say, 

If 

2  2  — ?  9   — 9  9 

c      ^  (l/k)  2  (trf  -  a^)V(^  )   . 

i-1  ^ 

where 

-2    >=   2 

i-l   ^ 

2 
Because  the  population  variances  (cr.'s)  are  unknown,  use  the  estimated 

2 

variances  (S  's)  to  estimate  the  parameters. 


3)  THE  METHOD  OF  BROWN  &  FORSYTHE 

Brown  &  Forsythe  (1974)  suggest  a  statistic  in  which  the 
numerator  Is  the  same  as  the  ANOVA  F  statistic.   The  difference  is  in 
the  denominator  which  has  expectation  equal  to.  the  numerator  when  all 
means  are  equal.   That  is, 


F*    1=1 


k 

n.  ^,,. 

1   1 


S  n.(x.   -  X   )^ 


^  2    ■ 

2  (1  -  n./N)S: 
i-1       i ■   1 

Critical  values  are  obtained  from  the  F  distribution  with  (k  -  1)  and 
f  degrees  of  freedom,  respectively,  where 

k   ~  (1  -  n./N)S? 

1/f  -  S  c^/(nj^  -  1)    and     c.  -  r ' —  . 

i=l  2 

2   (1  -  n./N)S: 

i-1       ^    ^ 
Brown  &  Forsythe  used  the  Satterthwaite  (1941)  approximation  for  f. 
When  there  are  only  two  groups  Brown  &  Forsythe 's  statistic  reduces  to 
what  is  known  as  the  Welch  (1936)  approximate  degrees  of  freedom 
solution  to  the  Behrens- Fisher  problem.   Although  Scheffe' (1944) 
proved  that  exact  solutions  of  this  type  cannot  be  found,  a  simulation 
study  by  Wang  (1971)  has  shown  this  approach  is  adequate  for  the  size 
of  the  test  for  the  k  -  2  case. 

An  Example  of  Brown  &  Forsythe 's  Method 
An  example  of  computations  for  Brown  and  Forsythe 's  method  are 
calculated  using  the  data  in  TABLE  1, 

P*   f  7(4.571-9.69)^+  6(11.667-9.69^^+  8(8  .  625-9  ,  69->^+  8(13.75-9.69^^1 
((1-7/29)16.286  +  (1-6/29)1.867  +  (1-8/29)9.696  +  (1-8/29)2.786  ] 

-  347.823/22.874  -  15.206 

c^~   (1  -  7/29)16.286/22.874  -  0.540 

c^-   (1  -  6/29)  1.867/22.874  -  0.065 

c^~   (1  -  8/29)  9.696/22.874  -  0.307 


c^-  (1  -  8/29)  2.786/22.874  -  0.088 

1/f  -  [.540^/6  +  .065^/5  +  .307^7  +  .088^/7]  -  0.064  ,   f-  15.625. 
Since  Brown  &  Forsythe's  method  yields  F  -  15.206,  compared  to 
F  05^^'  ■'-5 -^25)  -  3.256  the  null  hypothesis  would  be  rejected. 


4)  THE  METHOD  OF  WELCH 

Welch  (1951)  suggested  the  following  statistic, 

^     -     -   2   ■ 

2  w  (x   -  X  )^/(k  -  1) 

2   1-1  ^      ^- 
V 


[1  +  (2/3)(k  -  2)A] 


where 


2        _    k 
w  -  n  /S     ,      w  -  S  w.   ,       S   -  -i^i 


2  w.x. 
1  1 . 


1    L 


i-1 


A  - 


k 
S 

i-1 


3  S   (1  -  w  /w)V(n.  -  1) 


2 

(k^  -  1) 


The  numerator  of  Welch's  statistic  differs  from  the  ANOVA  F  numerator 
in  the  sense  that  it  weights  the  overall  mean  and  the  deviations  from 
it  by  w^   rather  than  n^^ .   The  critical  values  may  be  obtained  from  an 
F  distribution  with  (k  -  1)  and  (1/A)  degrees  of  freedom, 
respectively.   When  there  are  only  two  groups  Welch's  statistic  also 
reduces  to  what  is  known  as  the  Welch  approximate  degrees  of  freedom 
solution  to  the  Behrens- Fisher  problem. 
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An  Example  of  Welch's  Method 

An  example  of  computations  for  Welch's  method  are  calculated 
using  the  data  in  TABLE  1, 

w^  -  7/16.286  -  0.430      w  -  6/1.867  -  3.214 

Wj  -  8/  9.696  -  0.825 

w  -  (.43  +  3.214  +  .825  +  2.872)  -  7.341 

A  -  3{(1  -  .43/7.341)^/6  +  (1  -  3.214/7.341)^/5  +  (1  -  .825/7.341)^/7 

+  (1  -  2. 872/7. 341)^/71/15  -  3(.376)/15  -  0.075 
1/A  -  1/.075  -  13.333 

5   -  [.43(4.571)  +  3.214(11.667)  +  .825(8.625)  +  2 . 872 (13 . 75) ]/7 . 341 

-  86.069  /  7.341  -  11.724 

V  -  ([.43(4.571-11.724)^+  3.214(11.667-11.724)^+  .825(8.625-11.724)^ 

2. 872(13. 75-11. 724)^]/3)/[l  +  (2/3)2( . 075) ] 

-  (41.723/3)/!. 100  -  12.644. 

2 
Since  Welch's  method  yields  v  -  12.644,  compared  to  an  F  ^  (3,13.333) 

=  3.387  the  null  hypothesis  would  be  rejected. 

Levy  (1978)  proposed  that  the  non-null  distribution  of  Welch  can 
be  approximated  by  a  non- central  F  distribution  with  parameters  (k  - 
1)  ,   f"  ,  and   A  ,  where 

2  ^ 

f"  -  (  k   -  1  )  /  3  A        A  -  S  (  1  /  (  n.  -  1  ))(  1  -  w./  w  ) 

1-1         ^  ^ 

/   2  -    k 

w  -  n  /  CT.  w  -  2  w. 

'  '     "  i-1   ^ 


11 


_.   k         _  ■   k  _,  2 

u  =  Z  w .  u .  /  w  and         A=-  S  w.  (u,  -u) 

1=1  1=1 

Monte  Carlo  techniques  were  used  to  demonstrate  that  this 

approximation  is  reasonable.   Thus,  as  is  the  case  with  an  ANOVA,  one 

could  determine  appropriate  sample  sizes  for  achieving  a  desired  level 

of  power  associated  with  Welch's  test  or,  for  specific  sample  sizes, 

one  could  determine  the  power  of  Welch's  test  for  particular 

alternatives  to  the  null  hypothesis. 

5)  THE  METHOD  OF  JAMES 

James  (1951)  found  a  test  statistic  similar  to  Welch  which 
differs  primarily  in  its  approximations  for  the  critical  values.   The 
test  statistic  proposed  by  James  is  simply  the  numerator  of  Welch's 
statistic  and  may  be  written  as 

'^     -     -   2 
J  -  S  w  (X   -  X  )V(k  -  1) 
i-1 

1,  1, 


where  w,  —  n./S. 

i-1   "  ■•   i-1 


, ,  ,    ,      w  -  2  w .      and     x  -  S  w .  x .  /w 
11  ..1  ...,11.' 


2  2 

The  critical  value  is  x     h(Q)  where  x      is  the  (1  -  a)  percentile  from 

the  chi-square  distribution  based  on  (k  -  1)  degrees  of  freedom  and 

h(a)  -  {1  +  [(3x^  +  (k  +  l))/2(k^  -  1)][  S  (1  -  Wj^/  w)V(n.  -  1)11. 

i-1  "■ 

Just  like  Brown  &  Forsythe's  and  Welch's  statistic,  James'  statistic 

also  reduces  to  what  is  known  as  the  Welch  approximate  degrees  of 

freedom  solution  to  the  Behrens- Fisher  problem  in  the  two  sample  case. 
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An  Example  of  James'  Method 
An  example  of  computations  for  James'  method  are  calculated  using 
the  data  in  TABLE  1, 
v^  -   7/16.286  -  0.430  w^  -  6/1.867  -  3.214 

Wj  -  8/  9.696  -0.825  w^  -  8/2.786  -  2.872 

w  -  [.43  +  3.214  +  .825  +  2.872]  -  7.341 

X  -  [.43(4.571)  +  3.214(11.667)  +  .825(8.625)  +  2  .  872(13 . 75) ]/7 . 341 

-  86.069/7.341  -  11.724 

J  -  [.43(4.571-11.724)^+  3.214(11.667-11.724)^+  .825(8.625-11.724)^ 

+  2.872(13.75-11.724)^]  -  41.723/3  -  13.908. 
The  a  -   0.05  chi- square  critical  value  baseed  on  3  degrees  of  freedom 
is   7.815, 

h(a)  -  (1  +  [(3(7.815)  +  5)/2(15)][(l  -  .43/7.341)^/6 

+  (1  -  3.214/7.341)^5  +  (1  -  .825/7.341)^7  +  (1  -  2.872/7.341)^7  ]) 
-  [1  +  (28.445  /  30)(0.376)]  -  1.357 

;(^g^h(.05)  -  7.815(1.357)  -  10.605. 

Since  James'  method  yields  a  chl-square  value  of  13.908,  compared  to  a 
critical  value  of  10.605  the  null  hypothesis  would  be  rejected. 

OTHER  METHODS 

Several  techniques  exist  that  were  not  chosen  for  a  more  detailed 
investigation  and  comparison  either  because  they  were  too  complicated 
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to  be  feasibly  used  by  the  typical  researcher  or  little  was  found  on 
the  power  behavior  of  the  test  statistic. 

The  Second  Order  Method  of  James 
One  such  technique  Is  known  as  the  second  order  method  of  James. 
This  method  Includes  a  second  order  approximate  term  that  is  added  to 
the  correction  factor  h(a)  for  the  critical  value.   For  k  >  2  James 
proposes  to  use  the  first  order  method  for  smaller  samples  or  the 
usual  chl-square  critical  value  for  large  samples.   In  his  opinion  it 
would  Involve  too  much  numerical  calculation  to  Include  the  second 
order  correction  term  when  considering  the  small  gain  in  precision 
when  the  second  order  term  is  added  into  the  equation.   It  should  be 
noted  that  in  1951  the  computers  were  not  as  efficient  as  they  are 
today.   Thus,  if  James'  second  order  correction  was  Implemented  in  a 
statistical  package  so  that  hand  calculation  would  not  have  to  be 
done,  then  the  second  order  method  could  give  slightly  better 
approximations  than  the  first  order  method. 

The  Method  of  Unweighted  Means 
The  method  of  unweighted  means  is  another  technique  that  has  been 
widely  used  in  recent  years.   However,  Mllliken  &  Johnson  (1984)  do 
not  recommend  its  use  when  the  variances  are  unequal.   The  test 
statistic  is  given  by, 

n  2   (X.  -  X  )  /(k  -  1) 

^     1-1   ^-  


k   n. 

2   r^  (x    -  X   )V(N  -  k) 
i-1  j-1   ^J     ^- 
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k  -     k  _ 

where,      l/n  -  (1/k)  S   (1/n.)    "   and      x  -  Z  x.  /k   . 
1-1     ^  ■ ■   i-1   ^ ■ 

The  quantity  n  is  the  harmonic  mean  of  the  sample  sizes.   The  critical 
values  may  be  obtained  from  the  F  distribution  with  (k  -  1)  and 
(N  -  k)  degrees  of  freedom,  respectively.   This  analysis  yields 
reasonable  approximations  to  the  F  distribution  only  when  the  sample 
sizes  are  not  too  unequal.   A  theoretical  analysis  suggests  that  the 
size  of  this  technique  will  be  even  more  affected  by  heterogeneous 
variances  (when  the  sample  sizes  are  unequal)  than  the  usual  ANOVA  F 
statistic  (Kohr  &  Games  1974) .   For  this  reason  the  method  of 
unweighted  means  was  not  considered  in  the  detailed  comparisons.   It 
should  be  noted  that  when  the  sample  sizes  are  equal  this  analysis  is 
identical  to  ANOVA. 

The  Method  of  Two  Stage  Sampling 
Bishop  &  Dudewicz  (1978)  present  procedures,  with  tables  and 
approximations  needed  for  implementation,  which  give  exact  tests  with 
power  and  size  completely  independent  of  the  unknown  variances.   As  a 
historical  note,  two-stage  sampling  procedures  were  first  introduced 
by  Stein  (1945)  in  an  equal  variance  context.   The  procedure  of  Bishop 
&  Dudewicz  guarantees  that  the  probability  of  a  Type  I  error  is 
exactly  a,    and  the  power  is  exactly  (1  -  /3)  for  a  given  value  of  S 

^  -   2 

(where  5   =     Z      (fj..    -  /x)  )  chosen  by  the  researcher. 

i=l    ^ 
The  primary  purpose  of  the  first  stage  of  the  procedure  is  to 
obtain  estimates  of  the  k  variances  based  on  n  observations  randomly 
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chosen  from  each  treatment  group.   Once  the  sample  variances  are 
computed,  It  Is  possible  to  determine  N. ,  the  total  number  of 

observations  needed  from  the  i   group,  so  that  the  desired  power  will 
be  obtained.   The  second  stage  consists  of  sampling  the  additional 

(N   -  n)  observations  that  are  required  for  the  i   group  and  then 

testing  the  null  hypothesis  i^.    -  li     -    .  .  .    -  fi 

A  practical  problem  with  this  method  is  the  requirement  of  equal 
sample  sizes  in  the  first  stage.   Work  by  Wilcox  (1987)  proposes  a 
simple  yet  accurate  method  for  handling  unequal  sample  sizes  In  the 
first  stage  of  the  Bishop  &  Dudewicz  method.   If  obtaining  additional 
observations  is  impractical,  the  procedure  by  Wilcox  might  still  be 
useful  since  it  can  be  used  to  determine  whether  the  existing  sample 
sizes  are  reasonably  large  enough  to  obtain  the  desired  power.   This 
method  was  not  considered  because  of  its  complexity,  the  requirement 
of  obtaining  additional  samples,  and  the  lack  of  literature  which 
would  allow  comparison  to  the  other  procedures  under  consideration. 
On  the  other  hand,  if  the  researcher  does  have  the  luxury  of  obtaining 
additional  samples  for  each  treatment  then  this  method  is  possibly 
attractive  in  Its  ability  to  control  exactly  both  Type  1  error  rates 
and  power . 
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REVIEW  OF  LITERATURE 

In  this  section  five  papers  are  reviewed  that  compare  the 
performance  of  the  test  statistics  described  in  the  Alternative 
Methodologies  section.   The  tests  are  appraised  in  terms  of  the  Type  I 
error  rates  and  the  power  under  various  combinations  of  sample  sizes, 
variances  and  alternative  hypotheses.   The  discussion  of  each  paper  is 
divided  into  four  sections:  Purpose  and  Method,  Size,  Power  and 
Comments . 

M.  A.  Brown  and  A.  B.  Forsythe  (1974) 
Purpose  and  Method 
Brown  and  Forsythe  compared  the  performance  of  their  test 
statistic  with  the  test  of  Welch  , the  first  order  test  of  James  and 
the  ANOVA  F  statistic.   Using  four  groups,  six  groups  and  ten  groups 
the  size  of  each  test  was  studied.   The  sample  sizes  ranged  from  four 
to  twenty-one  and  the  standard  deviations  ranged  from  one  to  three. 
For  the  power  study,  four  groups  with  sample  sizes  (11,16,16,21)  were 
simulated  with  equal  and  unequal  variances  and  four  different  mean 
structures.   For  each  set  of  criterion  10,000  independent  replications 
were  simulated. 

Size 

The  ANOVA  F  statistic  shows  some  considerable  deviations  from  its 
nominal  size  when  the  sample  sizes  of  the  groups  are  unequal.   At  the 
5%  level  in  the  examples  shown,  the  empirical  size  of  the  ANOVA  F 
varies  from  3%  when  the  larger  sample  sizes  are  paired  with  the  larger 
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variances  to  17%  when  the  smaller  sample  sizes  are  paired  with  the 
larger  variances.   For  small  sample  sizes,  the  test  of  James  deviates 
more  widely  from  the  nominal  size,  rejecting  the  null  hypothesis  a 
little  too  often.   Overall  the  Type  I  error  rate  of  the  Brown-Forsythe 
test  varies  slightly  more  than  the  test  of  Welch,   For  groups  with 
more  than  ten  experimental  units  the  difference  between  the  nominal 
and  empirical  sizes  of  both  the  Brown-Forsythe  and  Welch  test  are 
small  with  Welch's  test  remaining  slightly  closer  to  the  nominal 
value,  in  most  cases.   Results  of  the  size  investigation  are  shown  in 
TABLE  2. 

Power 
The  power  results  from  this  study  are  given  in  TABLE  3.   The 
ANOVA  F  values  were  given  only  when  the  equal  variance  assumption  was 
met.   Because  the  Welch  test  and  the  test  of  James  have  similar 
numerators  and  Welch's  test  had  better  control  of  the  Type  I  error 
rate  the  power  calculations  from  James'  test  were  omitted.   When  the 
variances  were  equal  both  the  Welch  test  and  the  Brown-Forsythe  test 
had  only  slightly  less  power  than  the  ANOVA  F.   The  Brown-Forsythe 
test  showed  higher  power,  around  10%,  only  when  an  extreme  mean  was 
paired  with  the  largest  variance.   In  all  other  cases  The  Welch  test 
had  superior  power.   For  example,  when  an  extreme  mean  was  paired  with 
the  smallest  variance  the  gain  in  power  was  as  high  as  35%.   When 
extreme  means  coincided  with  the  largest  and  smallest  variance  as  much 
as  a  26%  gain  in  power  was  obtained  by  using  Welch's  test. 
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Comments 
The  combination  of  sample  sizes  and  standard  deviations 
adequately  demonstrated  the  size  of  the  various  test  statistics  in 
different  situations.   However,  the  power  study  was  limited  to  include 
only  one  sample  size  combination.   Other  combinations  would  have  been 
helpful  had  they  been  investigated.   Also,  the  power  of  the  ANOVA  F 
would  have  been  interesting  to  see  even  though  the  Type  I  error  rate 
was  not  close  to  the  nominal  level. 
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TABLE  2.   Empirical  Type  I  Error  Probabilities,  Nominal  size  -  .05. 

Standard 
Sample  Size  Deviation        ANOVA   Brown- 

Condition  Condition         F    Forsythe  Welch  James 

(4,4,4,4)  (1,1,1,1) 

(1,2,2,3) 

(4,8,10,12)  (1,1,1,1) 

(1,2,2,3) 
(3,2,2,1) 

(11,11,11,11)  (1,1,1,1) 
(1,2,2,3) 

(11,16,16,21)  (1,1,1,1) 
(3,2,2,1) 
(1,2,2,3) 

(4,4,4,4,4,4)  (1,1,1,1,1,1) 
(1,1,2,2,3,3) 

(4,6,6,8,10,12)  (1,1,1,1,1,1) 
(1,1,2,2,3,3) 
(3,3,2,2,1,1) 

(6,6,6,6,6,6)  (1,1,2,2,3,3) 

(11,11,11,11,11,11)  (1,1,2,2,3,3) 

(16,16,16,16,16,16)  (1,1,2,2,3,3) 

(21,21,21,21,21,21)  (1,1,2,2,3,3) 

(20,20,20,20,20,20    (1,1,1.5,1.5,2, 
20,20,20,20)  2,2.5,2.5,3,3) 


.049 

.034 

.045 

.079 

.067 

.041 

.047 

.081 

.051 

.048 

.057 

.067 

.030 

.057 

.049 

.056 

.144 

.062 

.065 

.077 

.051 

.049 

.051 

.055 

.063 

.057 

.050 

.054 

.049 

.051 

.050 

.053 

.108 

.062 

.055 

.058 

.040 

.065 

.054 

.056 

.049 

.034 

.061 

.095 

.083 

.049 

.070 

.109 

.046 

.045 

.061 

.074 

.031 

.065 

.062 

.075 

.170 

.058 

.068 

.084 

.071 

.052 

.057 

.073 

.073 

.065 

.057 

.062 

.072 

.068 

.051 

.052 

.069 

.065 

.048 

.049 

.071 

.066 

.052 

.053 
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TABLE  3.   Empirical  Power  of  the  Tests,  Nominal  Size  -  .05, 
Sample  Size  Condition  for  all  Cases  (11,16,15,21). 


Variance 

Mean 

Brown- 

Condition 

Structure 

ANOVA  F 

Forsythe 

Welch 

(1,1,1,1) 

(0,0,0,0) 

.049 

.051 

.050 

(1,0,0,0) 

.686 

.676 

.650 

(0,0,0,0.7) 

.553 

.544 

.523 

(0.5,0,0,0. 

5) 

.336 

.333 

.318 

(3,2,2,1) 

(0,0,0,0) 

.062 

.055 

(1.5,0,0,0) 

.332 

.222 

(0,0,0,1) 

.227 

.478 

(1.3,0,0,1. 

3) 

.424 

.682 

(1,2,2,3) 

(0,0,0,0) 

.065 

.054 

(1.3,0,0,0) 

.291 

.655 

(0,0,0,1) 

.273 

.175 

(1,0,0,1) 

.298 

.400 
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R.  L.  Kohr  and  P.  A.  Games  (1974) 
Purpose  and  Method 

Kohr  and  Games  study  the  size  and  power  of  the  ANOVA  F,  Welch 
test  and  the  test  by  Box.   The  study  of  the  sizes  of  the  various  tests 
was  done  with  four  groups.   The  sample  sizes  ranged  from  six  to 
fourteen  and  the  variances  ranged  from  one  to  thirteen  with  nine 
different  sets  of  variance  combinations.   The  power  study  included 
several  plots  of  the  power  of  the  tests  displayed  in  graphs  at' several 
different  levels  of  the  noncentrality  parameter.   Three  different  mean 
structures  were  investigated  with  the  same  variance  and  sample  size 
conditions  that  were  included  in  the  size  study.   For  each  set  of 
criterion,  four  blocks  of  500  replications  were  used  in  the 
simulation. 

Size 

Ranking  the  procedures  from  poor  to  good  TABLE  4  shows  the  ANOVA 
F  had  the  worst  control  of  the  size.   Next  was  Box's  test  and  finally 
Welch's  test  was  the  best.   For  the  equal  sample  size  case  when  there 
was  one  extreme  variance  the  ANOVA  F  performed  particularly  poorly 
with  the  size  of  the  test  doubling  the  desired  level.   Box's  test 
performed  similarly  with  one  extreme  variance  but  was  not  as  liberal 
as  the  ANOVA  F.   When  the  groups  had  several  different  variances  Box's 
test  was  somewhat  conservative.   Welch's  test  by  far  demonstrated  the 
best  control  of  the  Type  I  error  rate  with  empirical  probabilities 
ranging  from  5%  to  5.3%. 

For  the  unequal  sample  size  cases  the  ANOVA  F  tended  to  be  quite 
conservative  when  the  extreme  variance  was  paired  with  the  largest 
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sample  size  and  was  quite  liberal  when  the  extreme  variance  was  paired 
with  the  smallest  variance.   The  minimum  and  maximum  empirical  values 
for  the  ANOVA  F  were  2.2%  and  23.6%  respectively.   Box's  test  and 
Welch's  test  both  performed  more  adequately  than  the  ANOVA  F.   Overall 
the  test  by  Welch  was  slightly  better  in  the  unequal  sample  size  cases 
because  of  its  smaller  range  of  empirical  values  4%  to  5.3%  as 
compared  to  the  range  of  Box's  test  3.6%  to  7.6%. 

Power 
Power  functions  are  displayed  in  TABLES  5,  6,  8,  and  9.   These 
values  were  read  from  graphs .   The  power  functions  were  expressed  in 
terms  of  the  noncentrality  parameter, 

k 
i-1 
where 


[(n  S  (^    -   ;i)Vk)  /  "^    ]^^'^ 


k 
i-1 
Since  all  variance  conditions  used  had  an  average  of  four  and  all 
sample  size  conditions  had  a  harmonic  mean  of  eight  these  numbers  were 
used  in  the  above  formula. 

When  the  assumption  of  equal  variances  was  met  the  ANOVA  F 
statistic  was  more  powerful  than  either  the  Welch  test  or  the  Box  test 
but  the  increase  in  power  was  small,  ranging  from  2.5%  to  5.5%,   In 
TABLE  6  it  is  shown  that  when  the  means  are  evenly  spaced  apart  and 
the  sample  sizes  are  equal  then  Welch's  test  was  the  most  powerful 
with  Box's  test  having  a  severe  power  loss.   The  gain  in  power  for  the 
conditions  described  above  was  as  high  as  47%.   The  power  for  less 


extreme  variance  conditions  are  intermediate  between  those  shown  in 
TABLE  6  but  the  same  trends  hold  whenever  the  means  are  evenly  spaced 
apart  and  the  sample  sizes  are  equal.   Unfortunately,  when  the  null 
hypothesis  is  violated  in  other  ways,  the  power  depends  upon  what 
variances  accompany  the  deviant  means.   As  demonstrated  previouly  by 
Brown  and  Forsythe,  TABLE  5  shows  that  when  small  variances  are  paired 
with  the  more  deviant  means  Welch's  test  was  as  high  as  34%  more 
powerful  than  Box's  test  and  as  high  as  24%  more  powerful  than  the 
ANOVA  F  statistic.   When  larger  variances  are  paired  with  the  more 
deviant  means  the  ANOVA  F  statistic  was  the  most  powerful  while  Box's 
test  had  slightly  higher  power  than  Welch's  test,  as  shown  in  TABLE  5. 
Caution  must  be  used  in  this  situation  because  the  superiority  of  the 
ANOVA  F  may  be  inflated  due  to  the  fact  that  the  probability  of  a  Type 
I  error  may  be  twice  as  high  as  desired  because  of  its  lack  of 
control.   As  TABLE  5  shows,  in  some  situations  Box's  test  does  have 
more  power  than  Welch's  test  but  this  gain  in  power  is  only  as  high  as 
11%,   whereas,  when  reverse  conditions  hold  Welch's  test  provides 
gains  in  power  as  high  as  34%.   In  none  of  the  equal  sample  size  cases 
does  Box's  test  have  superior  power  over  both  the  ANOVA  F  and  Welch's 
test.   Thus  as  Kohr  and  Games  stated,  "The  only  absolute  statement 
that  can  be  made  about  the  equal  sample  size  case  when  population 
variances  are  unknown  is  that  the  Box  test  would  not  be  the  preferred 
test. " 

When  sample  sizes  are  unequal  and  the  variances  are  unequal,  the 
ANOVA  F  statistic 's  failure  to  control  the  probability  of  a  Type  I 
error  makes  its  use  questionable.   When  the  bias  coefficient  of  Box's 
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test  is  less  than  one  (b  <  1.0),  see  TABLE  7,  and  the  ANOVA  F  is 
conservative,  it  might  still  be  possible  that  the  ANOVA  F  would 
overcome  the  initial  conservative  bias  and  be  the  most  powerful 
alternative  for  medium  to  large  values  of  the  noncentrality  parameter. 
If  the  experimenter  is  willing  to  accept  a  5%  risk  of  a  Type  1  error, 
they  should  have  little  complaint  if  the  actual  size  is  1%  and  the 
test  is  still  the  most  powerful  of  any  available.   It  is  only  when 
(b  >  1.0)  and  the  ANOVA  F  is  biased  in  the  liberal  direction  that  the 
ANOVA  F  must  be  discarded. 

When  the  larger  variances  were  paired  with  the  larger  sample 
sizes  (b  <  1.0),  if  the  alternative  hypothesis  had  means  that  were 
equally  spaced  apart  or  deviant  means  paired  with  smaller  variances 
and  smaller  sample  sizes,  then  Welch's  test  was  consistently  the  most 
powerful  of  the  three  tests.   Within  each  alternative  hypothesis  the 
superiority  of  the  Welch  test  roughly  decreased  as  the  bias 
coefficient  (b)  approached  1.0. 

Under  the  alternative  where  (/j^  <  /i^   -  ^^   <   f,  )  the  results  were 
mixed  and  not  as  clear.   When  the  variances  were  (1,1,1,13)  the 
results  were  similar  to  those  above  with  Welch's  test  being  the  most 
powerful,  but  when  the  variances  were  (1,2,3,10)  or  (1,3,5,7)  the  Box 
test  was  the  most  powerful  by  as  much  as  18%  as  shown  in  TABLE  8.   To 
be  assured  of  using  the  most  powerful  test  the  experimenter  must  know 
whether  (b  >  1)  or  (b  <  1)  and  the  kind  of  alternative  hypothesis  that 
is  expected.   But  this  is  a  most  unlikely  situation  for  the  use  of  an 
omnibus  test  like  the  three  discussed  in  this  paper.   If  in  fact  the 
experimenter  anticipates  on  an  a  priori  basis  that  the  alternative 
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hypothesis  would  be  (ij^   <  fi^   -  ii^   <  ii^)    usually  they  would  perform 

appropriate  contrasts  to  gain  greater  power  rather  than  using  an 
omnibus  test. 

When  larger  population  variances  are  paired  with  the  smaller 
sample  sizes  (b  >  1.0)  results  as  shown  in  TABLE  9  apply.   The  complex 
results  for  this  variance  and  sample  size  condition  occured  when  the 
alternative  hypothesis  was  (fi^  <  n^  -  fi^  <   ^  ).   For  the  equally 

spaced  means  alternative  and  when  the  extreme  means  were  paired  with 
the  larger  sample  sizes  and  smaller  variances  the  Welch  test  was 
consistently  more  powerful  than  the  Box  test.   However,  for  the 
^'^1  ^  '^Z  ~  ^k  ^   '^2^  alternative,  the  Box  test  was  more  powerful  than 
Welch's  test  for  the  variance  condition  of  (7,5,3,1)  but  less  powerful 
for  the  (13,1,1,1)  variance  condition  as  shown  in  TABLE  9. 

Overall  the  Welch  test  demonstrated  superb  control  of  the  Type  I 
error  rate  and  usually  had  power  superior  to  Box's  test.   The  only 
times  that  the  Box  test  demonstrated  greater  power  were  on  the  few 
occasions  when  two  means  deviated  largely  from  the  grand  mean  and  both 
of  the  means  were  paired  with  relatively  large  variances.   As  noted  by 
Kohr  and  Games,  many  unequal  variance  conditions  produced  results 
where  the  power  superiority  of  the  Welch  test  was  even  greater  than 
the  47%  shown  in  TABLE  9,  while  the  power  superiority  of  the  Box  test 
over  Welch's  test  never  exceeded  more  than  13%. 

Kohr  and  Games  suggest  that  one  could  make  use  of  Box's  bias 
coefficient  (b)  as  follows.   If  the  bias  coefficient  is  between  0.88 
and  1.05  then  the  ANOVA  F  would  be  used  and  if  the  value  was  outside 
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this  range  then  the  Welch  test  would  be  used.   The  authors  also 
suggested  another  way  to  use  Box's  test  In  an  omnibus  procedure,  but 
felt  that  such  a  procedure  would  be  too  complex  for  the  everyday 
experimenter.   The  authors  conclude  that  the  Welch  test  is  the  test  of 
choice  when  the  sample  sizes  are  unequal,  but  it  is  not  better  than 
the  ANOVA  F  in  the  equal  sample  size  case. 

Comments 
This  paper  by  Kohr  and  Games  was  the  only  paper  found  that 
demonstrated  the  performance  of  both  Box's  test  and  Welch's  test  under 
numerous  sample  size  and  variance  conditions.   The  size  study  was 
conducted  with  nine  different  variance  conditions  and  showed  Just  how 
poorly  the  ANOVA  F  controls  the  size  when  the  variances  are  (1,1,1,13) 
or  (13,1,1,1)  and  the  sample  sizes  are  equal.   There  is  some  doubt 
about  the  author's  reccommendation  to  use  the  ANOVA  F  when  the  sample 
sizes  are  equal.   Although  the  ANOVA  F  did  have  superior  power,  its 
Inability  to  control  the  Type  I  error  rate  makes  its  use  questionable. 
With  the  above  variance  condition  the  empirical  Type  I  error  rate 
could  be  twice  as  high  as  desired  and  some  researchers  may  find  this 
objectionable.   The  increase  in  power  may  not  be  worth  the  risk  that 
is  involved.   The  findings  of  this  paper  about  the  performance  of 
Welch's  test  are  consistent  with  the  previous  paper.   New  findings 
included  the  comparisons  of  Box's  test  with  Welch's  test.   For  an 
omnibus  test  to  be  run  on  any  particular  set  of  data  where  the 
assumption  of  equal  variances  is  violated  the  Welch  test  should  be 
performed  because  of  its  excellent  control  of  the  size.   There  is 
little  doubt  that  in  some  situations  this  test  will  not  be  the  most 
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powerful  but  on  these  rare  occasions  the  power  loss  would  be  from  1% 
to  25%. 


TABLE  4.   Empirical  Type  I  Error  Probabilities,  Nominal  Size  -  .05. 


Sample  Size 
Condition 


Variance 
Condition 


Welch 


(8,8,8,8) 

(1,1,1,13) 

.115 

.070 

.063 

(1,2,3,10) 

.074 

.050 

.050 

(1,3.5,7) 

.065 

.041 

.056 

(2.67,4,4,5.33) 

.051 

.044 

.050 

(4,4,4,4) 

.052 

.045 

.048 

(5.33,4,4,2.67) 

.063 

.053 

.059 

(7,5,3,1) 

.060 

.045 

.054 

(10,3,2,1) 

.073 

.044 

.053 

(13,1,1,1) 

.099 

.055 

.053 

(6,8,9,9) 

(1,1,1,13) 

.074 

.048 

.050 

(1,2,3,10) 

.056 

.052 

.057 

(1,3,5,7) 

.057 

.050 

.054 

(2.67,4,4,5.33) 

.049 

.045 

.057 

(4,4,4,4) 

.052 

.046 

.055 

(5.33,4,4,2.67) 

.058 

.045 

.050 

(7,5,3,1) 

.081 

.049 

.059 

(10,3,2,1) 

.114 

.067 

.060 

(13,1,1,1) 

.160 

.067 

.053 

(5,7,10,14) 

(1,1,1,13) 

.030 

.053 

.055 

(1,2,3,10) 

.022 

.036 

.040 

(1,3,5,7) 

.033 

.047 

.053 

(2.67,4,4,5.33) 

.033 

.045 

.059 

(4,4,4,4) 

.043 

.043 

.046 

(5.33,4,4,2.67) 

.074 

.050 

.051 

(7,5,3,1) 

.134 

.059 

.050 

(10,3,2,1) 

.173 

.075 

.063 

(13,1,1,1) 

.236 

.076 

.055 

28 


TABLE  5.   Estimated  Empirical  Power  of  the  Tests,  Nominal  Size  -  .05, 
Sample  Size  Condition  (8,8,8,8),  Mean  Structure  (fi     <  p.     ~  fi     <  p.    )  . 


Var 

lance  Conditions 

Noncentral 

Lty 

(7 

5,3,1) 

(1 

3,5,7) 

Parameter 

ANOVA  F 

Box 

Welch 

ANOVA  F 

Box 

Welch 

0.0 

.075 

.038 

.050 

.069 

.025 

.031 

0.6 

.138 

.088 

.213 

.131 

.113 

.088 

1.0 

.313 

.225 

.531 

.306 

.250 

.200 

1.3 

.538 

.438 

.775 

.488 

.425 

.325 

1.6 

.756 

.656 

.931 

.681 

.581 

.469 

2.0 

.950 

.831 

.999 

.856 

.788 

.700 

TABLE  6.   Estimated  Empirical  Power  of  the  Tests,  Nominal  Size  - 
Sample  Size  condition  (8,8,8,8),  Mean  Structure  (^  <  ja  < 


05, 


V- 


Variance  Conditions 


Noncentral 

Lty 

(1, 

Parameter 

ANOVA 

0.0 

.119 

0.6 

.075 

1.0 

.319 

1.3 

.463 

1.6 

.631 

2.0 

.825 

1,1,13)  or  (13,1,1,1) 
F      Box    Welch 


088 

.063 

100 

.200 

200 

.500 

300 

.738 

425 

.894 

625 

.999 

(4,4 

4,4) 

ANOVA 

F 

Box 

Welch 

.088 

.031 

.050 

.144 

.113 

.125 

.344 

.300 

.300 

.513 

.481 

.481 

.688 

.663 

.644 

.900 

.875 

.850 

TABLE  7.   Variance  Conditions  of  the  Study. 


Coefficient 

Bias 

(h) 

Values 

of  Variation  of 

Variance 

Sample  S 

Lze 

Conditions 

the  Variances 

Condition 

(6,8,9,9) 

(5,7,10,14) 

1.299 

(1,1,1,13) 

0.875 

0.586 

0.884 

(1,2,3,10) 

0.884 

0.663 

0.559 

(1,2,3,7) 

0.894 

0.754 

0.235 

(2.67,4,4,5 

33) 

0.956 

0.889 

0.0 

(4,4,4,4) 

1.0 

1.0 

0.235 

(5.33,4,4,2 

67) 

1.048 

1.134 

0.559 

(7,5,3,1) 

1.127 

1.397 

0.884 

(10,3,2,1,) 

1.231 

1.568 

1.299 

(13,1,1,1) 

1.352 

1.778 
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TABLE 


Estimated  Empirical  Power  of  the  Tests,  Nominal  Size  -  .05, 


Sample  Size  Condition  (5,7,10,14),  Mean  Sturcture  (^i     <  ;j   -  ^ 


Noncentrallty 

Parameter 

ANOVA 

0.0 

.019 

0.6 

.100 

1.0 

.319 

1.3 

.525 

1.6 

.725 

2.0 

.913 

Variance  Condition  (1,3,5,7) 
F  Box 


Welch 


038 

.062 

163 

.125 

400 

.300 

619 

.463 

800 

.625 

931 

.813 

TABLE  9.   Estimated  Empirical  Power  of  the  Tests,  Nominal  Size  -  .05 


Sample  Size  Condition  (6,8,9,9),  Mean  Structure  (ii     < 


\<^'2^- 


Variance  Conditions 


Noncentra 

lity 

(7,5 

3,1) 

(13 

1,1,1) 

Parameter 

Box 

Welch 

Box 

Welch 

0.0 

.031 

.075 

.075 

.056 

0.6 

.150 

.113 

.125 

.225 

1.0 

.225 

.200 

.181 

.419 

1.3 

.375 

.306 

.275 

.638 

1.6 

.500 

.425 

.350 

.825 

2.0 

,T00 

.600 

.538 

.981 
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J.  B.  Dijkstra  and  P.  S,  P.  J.  Uerter  (1981) 
Purpose  and  Method 
Dijkstra  and  Werter  compared  the  size  and  power  of  three  tests, 
the  second  order  test  of  James,  the  Welch  test  and  the  test  of  Brown- 
Forsythe.   Although  James'  second  order  method  was  ruled  out  as  an 
alternative  to  the  ANOVA  F  statistic  in  the  Alternative  Methodologies 
section,  its  values  of  size  and  power  were  listed  to  demonstrate  its 
similarities  to  Welch's  test.   The  size  study  included  three  groups, 
four  groups  and  six  groups.   The  sample  sizes  ranged  from  four  to 
twenty  and  the  standard  deviations  ranged  from  one  to  three.   The 
power  study  included  two  sets  of  four  groups ,  one  equal  the  other 
unequal.   Three  different  alternatives  were  used  with  the  same 
standard  deviations  as  the  size  study.   For  each  set  of  criterion 
10,000  replications  were  simulated. 

Size 

The  range  of  the  size  of  all  three  test  statistics  for  the 
various  combinations  was  from  3.5%  to  7.5%,  so  all  three  have 
excellent  control  of  the  size  as  demonstrated  in  TABLE  10.   The  second 
order  test  of  James  was  the  test  statistic  that  remained  closest  to 
the  nominal  value  in  nearly  all  cases  whereas  Welch's  test  and  the 
test  of  Brown- Forsy the  behaved  very  similarly  but  both  were  slightly 
higher  than  the  nominal  value. 

Power 

Uniformly  none  of  the  tests  are  more  powerful  than  the  other  two. 
The  test  of  Welch  and  the  second  order  James'  test  are  almost 
identical  differing  only  in  the  third  decimal  place.   As  was  found 
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earlier  by  Brown  and  Forsythe,  when  an  extreme  mean  was  paired  with 
the  largest  variance  the  Brown- Forsythe  test  had  superior  power  by  as 
much  as  24%.   Conversely,  when  the  extreme  mean  was  paired  with  the 
smallest  variance  Welch's  test  and  the  second  order  test  of  James  had 
superior  power  by  as  much  as  32%.   When  the  alternative  included  two 
extreme  means  (5,0,0,0.5)  or  (0.5,0,0,5)  paired  with  the  extreme 
variances  (1,2,2,3)  the  test  with  superior  power  was  dictated  by 
whether  the  largest  extreme  mean  coincided  with  the  largest  variance 
or  not.   Thus  for  this  alternative  the  behavior  was  exactly  the  same 
as  above  when  there  was  only  one  extreme  mean,  but  with  equal  sample 
sizes  the  difference  in  power  was  around  21%  and  with  unequal  sample 
sizes  the  difference  was  only  as  high  as  6%.   Results  of  the  power 
study  are  displayed  in  TABLE  11. 

Comments 
The  sample  sizes  chosen  for  the  size  study  were  adequate  but  a 
few  more  variance  combinations  would  have  been  helpful  for  both  the 
size  study  and  the  power  study.   A  variance  combination  with  just  one 
extreme  variance  (i.e.  (1,1,1,3))  would  have  added  to  the  utility  of 
this  Investigation.   Some  cases  of  the  noncentrality  parameter  were 
chosen  poorly  (i.e.  (5,0,0,0.5)).   It  would  have  been  better  to  choose 
an  alternative  like  (1,0,0,0.5)  or  (0.5,0,0,0.5).   The  detection  of  a 
difference  in  the  means  is  more  difficult  when  the  means  are  closely 
grouped.   Thus  if  a  test  has  good  power  when  there  is  only  a  small 
difference  in  the  means  then  it  would  follow  logically  that  the  power 
would  be  even  higher  when  the  means  differ  by  a  large  amount.   For  the 
above  alternative  the  power  of  all  three  tests  is  so  high  that  it  is 
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difficult  to  assess  which  test  is  superior.   However,  the  other 
alternatives  do  show  more  clearly  the  power  behavior  of  the  tests.   It 
might  have  been  helpful  to  include  the  size  and  power  of  the  ANOVA  F 
along  with  the  other  tests  as  well.   Because  the  behavior  of  the  Welch 
test  and  the  second  order  test  of  James  are  nearly  identical  the 
simpler  test  by  Welch  would  be  the  test  of  choice  because  of  the 
complexity  in  calculating  James'  second  order  test.   The  findings  of 
this  paper  are  consistent  with  the  findings  of  the  previous  papers. 
New  findings  Include  the  similarity  of  Welch's  test  and  James'  test 
and  the  power  behavior  under  the  (5,0,0,0.5)  and  (0.5,0,0,5) 
alternatives. 
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TABLE  10.   Empirical 

Type  I  Error  Probabilities, 

Nominal 

size  -  .05. 

Standard 

Sample  Size 

Deviation 

Brown- 

Condition 

Condition 

Forsythe 

James 

Welch 

(4,4,4) 

(1,1,1) 

.037 

.044 

.041 

(1,2,3) 

.048 

.053 

.049 

(4,6,8) 

(1,1,1) 

.044 

.050 

.049 

(1,2,3) 

.050 

.043 

.042 

(4,4,4,4)   ■  ' 

(1,1,1,1) 

.040 

.052 

.050 

(1,2,2,3) 

.044 

.057 

.056 

(4,6,8,10) 

(1,1,1,1) 

.044 

,049 

.052 

(1,2,2,3) 

.052 

.044 

.045 

(3,2,2,1) 

.059 

.054 

.059 

(10,10,10,10) 

(1,1,1,1) 

.048 

.050 

.051 

(1,2,2,3) 

.056 

.050 

.051 

(10,14,16,20) 

(1,1,1,1) 

.046 

.046 

.046 

(1,1.5,2,3) 

.061 

.050 

.050 

(3,2,1.5,1) 

.062 

.046 

.048 

(4,4,4,4,4,4) 

(1,1,1,1,1 

,1) 

.035 

.053 

.062 

(1,1,2,2,3 

,3) 

.046 

.064 

.075 

(4,6,8,10,12,14: 

) 

(1,1,1,1,1 

,1) 

.043 

.053 

.062 

(1,1,2,2,3, 

,3) 

.065 

.051 

.056 

(3,3,2,2,1. 

.1) 

.057 

.058 

.069 

(10,10,10,10,10, 

10) 

(1,1,1,1,1, 

1) 

.048 

.050 

.052 

(1,1,2,2,3, 

3) 

.065 

.051 

.053 

(10,10,15,15,20, 

20) 

(1,1,2,2,3, 

3) 

.068 

.051 

.053 

(3,3,2,2,1, 

1) 

.063 

.053 

.056 
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TABLE  11. 

Empirical  Power 
Standard 

of  the  Tests, 

Nominal  Size 

-  .05. 

Sample  Size    Deviation 

Mean 

Br  own - 

Condition 

Condition 

Structure 

Forsythe 

James 

Welch 

(4,4,4,4) 

(1,1,1,1) 

(0,0,0,0) 

.035 

.053 

.052 

(5,0,0,0.5) 

1.000 

.999 

.999 

(3,0,0,0) 

.951 

.875 

.874 

(4,4,4,4) 

(1,2,2,3) 

(0,0,0,0) 

.046 

.057 

.057 

(3,0,0,0) 

.312 

.616 

.616 

(0,0,0,3) 

.306 

.220 

.220 

(5,0,0,0.5) 

.751 

.974 

.972 

(0.5,0,0,5) 

.660 

.449 

.447 

(4,6,8,10) 

(1,1,1,1) 

(0,0,0,0) 

.047 

.052 

.055 

(3,0,0,0) 

.986 

.935 

.941 

(0,0,0,3) 

1.000 

1.000 

1.000 

(4,6,8,10) 

(1,2,2,3) 

(0,0,0,0) 

.057 

.050 

.051 

(3,0,0,0) 

.556 

.873 

.876 

(0,0,0,3) 

.746 

.521 

.524 

(5,0,0,0.5) 

.983 

.999 

.999 

(0.5,0,0,5) 

.999 

.923 

.926 
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R.  R.  Wilcox,  V.  L.  Charlln  and  K.  L.  Thompson  (1986) 
Purpose  and  Method 
The  authors  of  this  paper  compare  the  performance,  in  terms  of 
size  and  power,  of  the  ANOVA  F  statistic,  Welch's  test  and  the  test  of 
Brown- Forsy the.   The  size  study  included  four  groups  and  six  groups 
with  sample  sizes  ranging  from  four  to  fifty.   The  standard  deviations 
ranged  from  one  to  four  and  standard  deviation  conditions  Included  one 
extreme  standard  deviation,  equally  spaced  standard  deviations,  as 
well  as  other  combinations.   The  power  study  included  the  same  sample 
size  and  standard  deviation  combinations  with  one  alternative 
hypothesis.   For  each  set  of  criterion  10,000  replications  were 
simulated. 

Size 
With  as  many  as  fifty  observations  per  group  and  equal  sample 
sizes  in  four  groups  the  ANOVA  F  can  be  very  unsatisfactory  when  there 
is  one  extreme  variance.   Comparing  the  equal  sample  size  cases 
(11,11,11,11),  (21,21,21,21)  and  (50,50,50,50)  with  one  extreme 
standard  deviation  (1,1,1,4)  it  becomes  apparent  that  the  robustness 
of  the  ANOVA  F  improves  very  slowly  as  the  sample  sizes  increase,  and 
it  is  not  obvious  when,  if  ever,  the  sample  sizes  would  be  large 
enough  to  indicate  that  the  ANOVA  F  would  be  acceptable  in  terms  of 
its  size.   As  TABLE  12  shows  the  empirical  probability  of  a  Type  I 
error  starts  at  11%  and  only  reduces  to  9%  when  the  groups  have  fifty 
observations.   For  equal  sample  sizes  Welch's  test  performs  better 
than  the  test  of  Brown- Forsythe  in  the  sense  that  the  maximum 
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empirical  Type  I  error  probability  for  the  Brown- Forsythe  test  was 
8.4%  while  the  maximum  value  for  Welch's  test  was  only  6.0%. 

However,  when  the  four  groups  had  unequal  sample  sizes  the  choice 
between  the  Brown-Forsythe  test  and  Welch's  test  is  not  as  clear-cut. 
There  are  instances  where  one  will  have  a  slight  advantage  over  the 
other  but  both  tests  have  Type   I  error  rates  between  4.4%  and  8.6%. 
When  the  larger  standard  deviations  were  paired  with  the  larger  sample 
sizes  Welch's  test  remained  closer  to  the  nominal  value.   Conversely, 
when  the  larger  standard  deviations  were  paired  with  the  smaller 
sample  sizes  the  Brown-Forsythe  test  remained  closer  to  the  nominal 
value.   The  Type  I  error  rate  for  the  ANOVA  F  ranged  from  2.7%  to 
27.9%,  being  very  conservative  when  the  larger  standard  deviations 
coincided  with  the  larger  sample  sizes  and  quite  liberal  when  larger 
standard  deviations  coincided  with  the  smaller  sample  sizes. 

With  six  groups  of  equal  sample  size  Welch's  test  had  a  slight 
edge  over  the  Brown-Forsythe.   When  the  group  sizes  were  unequal  the 
behavior  of  all  three  tests  were  very  similar  to  the  four  group 
situations.   Overall,  the  edge  would  be  given  to  the  Welch  test 
because  of  its  smaller  maximum  empirical  rate  of  8.6%  as  compared  to 
the  maximum  empirical  rate  of  10%  for  the  Brown-Forsythe  test. 

Power 

Even  when  the  variances  are  equal  there  is  only  a  slight  loss  of 
power,  around  1-2%,  when  using  Welch's  test  or  the  Brown-Forsythe  test 
as  compared  to  using  the  ANOVA  F.   As  several  authors  have  noted 
previously,  when  extreme  means  are  paired  with  small  standard 
deviations  Welch's  test  has  superior  power  by  as  much  as  69%,  as  shown 
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in  TABLE  13.   When  extreme  means  are  paired  with  larger  standard 
deviations  the  Brown- Forsythe  test  has  superior  power  but  only  by  as 
much  as  26%.   In  the  cases  where  the  Welch  test  does  not  have  superior 
power  the  ANOVA  F  has  about  20-25%  higher  power  than  Welch's  test  but 
for  these  exact  cases  the  ANOVA  F  has  particularly  poor  control  over 
the  size  so  that  the  increase  in  power  may  be  due  to  the  inflated  Type 
I  error  rate. 

Comments 
The  combination  of  sample  sizes  and  standard  deviations  were 
adequate  to  show  the  size  behavior  of  the  various  tests.   The  findings 
of  this  paper  are  consistent  with  those  previously  reviewed.   This 
paper  uncovered  a  condition  when  the  performance  of  the  Welch  test  Is 
not  good.   This  occurs  when  there  Is  one  or  more  extreme  standard 
deviations  and  they  are  paired  with  the  smaller  sample  sizes  the 
empirical  Type  I  error  rate  of  Welch's  test  reaches  its  maximum  value, 
around  8%.   Other  tests  perform  much  worse  than  Welch's  test  and  don't 
reveal  any  consistent  pattern.   Although  the  combination  of  sample 
sizes  and  standard  deviations  were  adequate  for  this  investigation  the 
power  performances  were  limited  by  the  use  of  only  one  alternative 
hypothesis.   Other  alternatives  would  have  been  helpful  in  this  study. 
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TABLE  12.   Empirical 

Type  I  Error 

Probabilities, 

Nominal  size 

-  .05. 

Standard 

Sample  Size 

Deviation 

Brown- 

Condition 

Condition 

ANOVA  F 

Forsythe 

Welch 

(11,11,11,11) 

(1,1,1,1) 

.048 

.046 

.055 

(1,2,3,4) 

.068 

.051 

.060 

(4,1,1,1) 

.109 

.084 

.055 

(21,21,21,21) 

(1,1,1,1) 

.051 

.050 

.056 

(1,2,3,4) 

.069 

.065 

.056 

(4,1,1,1) 

.097 

.084 

.055 

(50,50,50,50) 

(4,1,1,1) 

.088 

.084 

.044 

(4,8,10,12) 

(1,1,1,1) 

.051 

.048 

.072 

(4,3,2,1) 

.173 

.065 

.086 

(1,1,1,4) 

.041 

.075 

.069 

(4,1,1,1) 

.279 

.081 

.082 

(6,10,16,20) 

(1,1,1,1) 

.053 

.069 

.065 

(4,3,2,1) 

.194 

.059 

.070 

(1,1,1,4) 

.027 

.077 

.062 

(4,1,1,1) 

.275 

.072 

.068 

(15,15,15,15,13,15) 

(1,1,1,1,1 

,1) 

.049 

.048 

.062 

(1,1,1,4,4 

,4) 

.080 

.071 

.064 

(1,1,1,1,1 

,4) 

.119 

.095 

.064 

(6,10,15,18,21,25) 

(1,1,1,1,1 

,1) 

.047 

.047 

.075 

(1,1,1,4,4 

,4) 

.029 

.080 

.068 

(4,4,4,1,1 

,1) 

.234 

.069 

.080 

(1,1,1,1,1 

,4) 

.041 

.100 

.073 

(4,1,1,1,1 

.1) 

.309 

.091 

.078 

(4,3,3,1,1 

,4) 

.091 

.062 

.080 
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TABLE  13.   Empirical  Power  of  the  Tests,  Nominal  size  =-  .05, 
Mean  Structure  for  All  Cases  (1.2,0,0,0)  or  (1.2,0,0,0,0,0). 


Sample  Size 
Condition 


Standard 

Deviation 

Condition 


Brown- 

Forsythe 


Welch 


(11,11,11,11) 

(1,1,1,1) 

.794 

.789 

.773 

(4,1,1,1) 

.244 

.206 

.118 

(21,21,21,21) 

(1,1,1,1) 

.983 

.983 

.981 

(4,1,1,1) 

.372 

.348 

.180 

(50,50,50,50) 

(4,1,1,1) 

.604 

.593 

.334 

(4,8,10,12) 

(1.1,1,1) 

.396 

.366 

.412 

(4,3,2,1) 

.232 

.094 

.109 

(1,1,1,4) 

.060 

.112 

.392 

(4,1,1,1) 

.359 

.114 

.106 

(6,10,16,20) 

(1.1,1,1) 

.592 

.564 

.570 

(4,3,2,1) 

.282 

.111 

.107 

(1,1,1,4) 

.050 

.144 

.545 

(4,1,1,1) 

.381 

.132 

.108 

(15,15,15,15,15,15) 

(1,1,1,1,1 

.  1) 

.908 

.907 

.898 

(1,1,1,4,4 

.4) 

.152 

.137 

.825 

(1.1.1.1,1 

.4) 

.334 

.275 

.883 

(6,10,15,18,21,25) 

(1,1,1,1,1 

.1) 

.546 

.525 

.552 

(1,1,1,4,4 

.4) 

.038 

.104 

.505 

(4,4,4,1,1 

,1) 

.307 

.107 

.113 

(1,1,1,1.1 

|4) 

.070 

.168 

.546 

(4,1,1,1,1 

.1) 

.422 

.154 

.113 

(4,3,3,1,1, 

,4) 

.154 

.108 

.113 
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A.  J.  Tomarken  and  R.  C.  Serlin  (1986) 
Purpose  and  Method 

Toraerken  and  Serlin  study  the  size  and  power  of  the  ANOVA  F, 
Welch,  Brown-Forsythe,  Kruskal-Wallis  and  inverse  normal  scores  tests. 
Only  the  first  three  tests  will  be  discussed  in  this  investigation. 
The  size  and  power  studies  included  three  groups  and  four  groups  with 
sample  sizes  ranging  from  six  to  thirty.   For  the  power  study  four 
alternatives  were  investigated  with  four  different  sample  size  and 
variance  combinations : equal  variances,  equal  pairing  (equal  sample 
sizes  and  increasing  variances),  direct  pairing  (increasing  sample 
sizes  and  increasing  variances),  and  inverse  pairing  (increasing 
sample  sizes  and  decreasing  variances).   For  each  set  of  criterion 
1,000  replications  were  simulated.   Tables  of  the  results  from  this 
investigation  were  averaged  across  cases  that  were  similar  in  design. 
The  design  structure  is  shown  in  TABLE  14,  only  cases  with  the  same 
letter  were  averaged  together. 

Size 

With  three  groups  of  equal  sample  sizes,  when  the  variances  were 
unequal,  in  three  of  the  four  cases  the  empirical  rejection  rates  of 
the  ANOVA  F  statistic  exceeded  the  robustness  criterion  of  7%  adopted 
by  the  authors.   With  four  groups  of  equal  sample  sizes  and  unequal 
variances  the  ANOVA  F  performed  more  acceptably,  only  exceeding  the 
robustness  criterion  once  in  the  four  cases,  but  the  average  was 
higher  than  that  of  Welch's  test  or  Brown-Forsythe ' s  test.   Both 
Welch's  test  and  Brown-Forsyth's  test  performed  adequately  in  the 
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equal  sample  size  cases  with  Welch's  test  remaining  slightly  closer  to 
the  nominal  value. 

In  the  unequal  sample  size  cases  the  ANOVA  F  showed  marked 
deviations,  being  too  conservative  when  larger  sample  sizes  were 
paired  with  larger  variances  and  too  liberal  when  larger  sample  sizes 
were  paired  with  smaller  variances.   Results  are  shown  in  TABLE  15. 
The  Welch  test  and  the  Brown- Forsythe  test  remained  within  the 
authors'  robustness  limits  in  both  the  three  and  four  group  cases. 
Once  again  the  Welch  test  remained  slightly  closer  to  the  nominal 
level  than  the  Brown- Forsythe  test.   As  noted  in  the  previous  paper, 
when  larger  sample  sizes  were  paired  with  smaller  variances  Welch's 
test  had  a  slightly  poorer  performance.   Overall,  Welch's  test  showed 
the  best  control  of  the  Type  I  error  rate. 

Power 

Results  summarizing  the  behavior  of  the  various  tests  when  the 
variances  are  equal  are  shown  in  TABLE  16.   For  all  cases  mean 
structures  were  specified  to  an  estimated  ANOVA  power  of  0.70  and 
nominal  level  0.05.   Additional  mean  structures  were  specified  with 
estimated  ANOVA  power  of  0.85  and  0.55,  these  cases  are  denoted  by  * 
and  **  respectively.   As  expected  the  ANOVA  F  had  the  highest  power 
but  only  by  about  1-5%.   Brown- Forsythe ' s  test  had  slightly  higher 
power  than  Welch's  test.   It  may  be  suprising  that  only  minimal  losses 
in  power  are  incurred  when  the  ANOVA  F  alternatives  are  used  in  the 
equal  variance  cases. 
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The  results  from  the  three  sample  size  and  variance  combinations, 
equal  pairing,  direct  pairing  and  inverse  pairing  are  shown  in  TABLE 
17.   Although  for  the  equal  sample  size  cases  the  ANOVA  F  was  the  most 
powerful  when  the  extreme  mean  was  paired  with  the  largest  variance, 
the  Type  I  error  assessments  showed  that  it  was  frequently  too  liberal 
under  these  conditions,  particularly  when  there  were  three  groups.   A 
striking  consistency  existed  across  all  the  cases  studied.   In  each  of 
the  three  sample  size  and  variance  combinations,  the  Welch  test  proved 
to  be  the  most  powerful  procedure  when  means  were  equally  spaced 
apart,  when  extreme  means  were  paired  with  the  smallest  variances  and 
when  two  identical  means  were  situated  midway  between  two  extreme 
means.   Although  the  Welch  test  was  consistently  superior  for  these 
mean  patterns,  its  relative  advantage  varied  somewhat  across 
conditions.   Its  rejection  rates  ranged  from  5%  to  15%  higher  than  the 
Brown- Forsy the  test  when  the  means  were  equally  spaced  apart.   When 
extreme  means  were  paired  with  the  smallest  variances  Welch's 
superiority  ranged  from  5%  to  35%  and  when  two  identical  means  were 
situated  between  two  extreme  means  the  superiority  ranged  from  9%  to 
21%. 

Although  the  Welch  test  was  unequivocally  the  test  of  choice  for 
three  of  the  four  mean  structures,  the  Brown- Forsy the  test  was 
optimal,  though  less  clearly  so,  for  the  mean  structure  where  an 
extreme  mean  was  paired  with  the  largest  variance.   The  superiority  of 
Brown-Forsythe's  test  ranged  from  8%  to  18%.   Even  though  the  ANOVA  F 
had  the  highest  power  its  severe  lack  of  control  of  the  size  renders 
it  an  unreasonable  test. 
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Comments 
This  study  included  sufficient  combinations  of  sample  sizes  and 
variances  to  see  the  size  and  power  behavior  of  the  tests.   The  only 
variance  condition  that  was  omitted  that  may  have  been  helpful  would 
have  been  the  situation  where  just  one  extreme  variance  existed. 
Although  the  figures  in  the  tables  accurately  described  the  behavior 
of  the  various  tests,  the  way  they  were  tabulated  was  confusing.   Even 
though  there  would  have  been  four  times  as  many  tables  if  listed 
separately  the  authors  could  have  chosen  the  ones  that  were 
outstanding  or  demonstrated  a  point  that  was  being  made.   The  findings 
in  this  paper  were  consistent  with  previous  papers.   New  findings 
included  the  demonstration  that  the  same  behavior  of  the  tests  found 
earlier  could  be  expected  with  medium  and  large  sample  sizes  and 
confirmation  that  the  ANOVA  F  fails  to  adequately  control  the  Type  I 
error  rate  even  when  the  sample  sizes  large. 
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TABLE  14.   Design  of  the  Monte  Carlo  Investigation. 

Sample  Size  Variance  Conditons 

Condition      (5,6,6)     (12,4,1)     (6,2,1)     (1,4,12)   (1,2,6) 

(20,20,20)  A         B~^        B 

(12,12,12)  ABB 

(30,20,10)  A         C         C  D        D 

(18,12,6)  A         C         C  B       D 

Variance  Conditions 
(6,6,6,6)   (12,6,4,1)   (6,3,2,1)   (1,4,6,12)  (1,2,3,6) 

(20,20,20,20)  A         B         B 

(12,12,12,12)  ABB 

(30,24,16,10)  A         C         C  D       D 

(18,14,10,6)  A         C         C  D        D 


TABLE  15.   Empirical  Type  I  Error  Probabilities,  Nominal  Size  -  .05, 
Probabilities  Given  are  Averaged  Across  Cases  With  Same  Letter. 

Sample  Size  Variance 

Condition  Condition 

2   2   2 
(All  4  cases)  (tj  -ct  -cj  ) 

2   2   2 
(Hj^-n^-nj)  {a^>a^>a^) 

2  2   2 
(n^>n2>nj)  (^a^a^a^) 

2  2  2 
(nj^>n2>n2)  (,a^<a^<a^) 

2   2  2  2 
(All  4  cases)  (ff  -ct  -;7,-ct  ) 

2  2   2   2 

'"l""2""3""4''  (''i>''2^''3^''4'' 

2  2  2   2 
(n^>n2>nj>n^)  (,a^a^a^>a^) 

2  2  2  2 
(n^>n2>n2>n^)  (,a  ^<a  ^<a  ^<a  ^) 

NOTE:   The  number  in  parentheses  is  the  number  of  times  out  of  the 
four  cases  that  the  empirical  rejection  rate  was  greater  than  .07. 


Brown- 

ANOVA  F 

Welch 

Forsythe 

.053(0) 

.050(0) 

.051(0) 

.069(3) 

.048(0) 

.062(0) 

.022(0) 

.049(0) 

.061(0) 

.167(4) 

.057(0) 

.064(1) 

.053(0) 

.055(0) 

.052(0) 

.064(1) 

.048(0) 

.059(0) 

.025(0) 

.050(0) 

.059(0) 

.144(4) 

.056(0) 

.059(0) 
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TABLE  15,   Empirical  Power  of  the  Tests,  Nominal  Level  -  .05, 
Equal  Variances,  Probabilities  Given  are  Averaged  Across  Cases  With 
Same  Letter. 


Sample  Size 

Mean 

Number 

ANOVA 

Brown- 

Condition 

Structure 

Of  Cases 

F 

Welch 

Forsythe 

(All  4  cases) 

(^l^>^i^>^^) 

4 

.697 

.665 

,684 

(n^^-n^-n^) 

CMi>M2-M3) 

2 

.712 

.693 

.709 

(n^>n2>nj) 

(;'1>M2-M3) 

2 

.700 

.662 

.674 

(n^>n2>n2) 

(Mi-^2>;^3) 

2 

.683 

,639 

,664 

(All  4  cases) 

(fl^>f,^>,,^>ll^) 

4 

.698 

.664 

.684 

("l-"2""3""4^ 

(;.^>M2-''3-^> 

2 

.704 

,672 

,702 

(nj^>n2>n2>n^) 

(.^^>^.^-,.^-^^) 

2 

.682 

,628 

,666 

(n^>n2>n2>n^) 

^"r^a-^s^^^ 

2 

,684 

,637 

,667 

(All  4  cases) 

(^^>f.^.t,^>^^) 

4 

.699 

,660 

,685 
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TABLE  17.   Empirical  Power  of  the  Tests,  Nominal  Level  -  .05, 

For  All  Cases  Mean  Structures  were  Specified  to  an  Estimated  ANOVA 

Power  of  .700. 


Sample  Size 

and  Variance 

Mean 

Number 

ANOVA 

Brown- 

Condition 

Structure 

Of  Cases 

F 

Welch 

Forsythe 

Equal  Pairing 

I^l>f2>'^^ 

4 

.683 

.765 

.659 

^"l""2""3^ 

Ml>M2=/'3 

4 

.646 

.493 

.628 

2   2   2 

(M^>^2-''3^* 

(2) 

(.803) 

(.645) 

(.784) 

^-''2>''3 

4 

.766 

.938 

.743 

(1) 

(.554) 

(.864) 

(.529) 

Direct  Pairing 

M^>M2>''3 

4 

.665 

.909 

.855 

(n  >n  >n  ) 

/'l>/^2-*'3 

4 

.634 

.680 

.804 

2   2   2 

^j,-M2>''3 

4 

.760 

.992 

.940 

(3) 

(.547) 

(.945) 

(.803) 

Inverse  Pairing 

''l>''2>''3 

4 

.691 

.538 

.387 

(n^>n2>n2) 

(^]^>M2>M3) 

(2) 

(.840) 

(.701) 

(.514)' 

2   2   2 

''l>''2-''3 

4 

.767 

.722 

.421 

Mi-M2>M3 

4 

.646 

.298 

.384 

(^]^-A'2>/J3) 

W 

(.756) 

(.404) 

(.514) 

Additional  power  assessment  with  estimated  ANOVA  power  of  850 

*     .  r  •      . 

Additional  power  assessment  with  estimated  ANOVA  power  of  .550. 
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TABLE  17  cont. 


Sample  Size 

*■ 

and  Variance 

Mean 

Number 

ANOVA 

Brown- 

Condition 

Structure 

Of  Cases 

F 

Welch 

Forsythe 

Equal  Pairing 

''l>^2>''3>''4 

4 

.681 

.779 

.660 

(n^-n^-n^-n^) 

''l>'^2-''3-''4 

4 

.634 

.444 

.622 

2   9   9? 

{^^>^^-t.^.f.^^)* 

(4) 

(.756) 

(.570) 

(.730) 

M^-/x2-;i3>/i^ 

4 

.770 

.961 

.751 

(;.^>;x2-M3>M4)** 

(2) 

(.584) 

(.924) 

(.564) 

^^>^^.^^>^^ 

4 

.688 

.816 

.667 

Direct  Pairing 

^j^>M2>A3>M4 

4 

.665 

.900 

.817 

(n^>n2>nj>n^) 

^l>'^2-''3=''4 

4 

.613 

.567 

.756 

2  2   2   2 

''l-''2-''3>^4 

4 

.770 

.994 

.925 

** 

{^^-^^.^^>l,^) 

(4) 

(.550) 

(.964) 

(.777) 

^^>^^-^^>fj.^ 

4 

.637 

.886 

.799 

Inverse  Pairing 

;i^>M2>M3>M4 

4 

.712 

.615 

.467 

<n^>n2>n2>n^) 

''l>''2-''3-''4 

4 

.792 

.848 

.497 

2  2  2  2 

^.^-^^-^.^>^.^ 

4 

.645 

.301 

.430 

(^j_-M2-/i3>,.^)* 

(4) 

(.754) 

(.402) 

(.553) 

''l>^2-''3>''4 

4 

.728 

.678 

.471 

Additional  power  assessment  with  estimated  ANOVA  power  of  .850. 
Additional  Power  assessment  with  estimated  ANOVA  power  of  .550. 
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CONCLUSIONS  AND  RECOMMENDATIONS 


An  unexpected  finding  was  the  lack  of  robustness  of  the  ANOVA  F 
when  the  variances  were  unequal.   Even  when  the  sample  sizes  were 
equal  the  empirical  Type  I  error  probabilities  could  be  twice  as  high 
as  desired,  particularly  when  there  was  only  one  extreme  variance.   As 
seems  to  be  well  known,  when  the  sample  sizes  were  unequal  the  ANOVA  F 
showed  marked  deviations  from  the  nominal  level,  being  too 
conservative  when  larger  variances  were  paired  with  larger  sample 
sizes  and  too  liberal  when  larger  variances  were  paired  with  smaller 
sample  sizes.   All  other  tests  did  remarkably  well  In  terms  of 
controlling  the  Type  I  error  rate.   The  test  that  had  the  best  control 
of  the  Type  1  error  rate  was  Welch's  test.   Even  at  its  worst  the 
empirical  rate  rarely  reached  as  high  as  8%.   This  occurred  when  there 
was  one  extreme  variance  and  when  the  sample  sizes  and  variances  were 
inversely  paired. 

Although  only  one  paper  considered  the  test  of  Box,  the  paper  did 
raise  some  doubts  about  the  utility  of  the  test.   It  was  not  the  test 
of  choice  even  when  the  sample  sizes  were  equal  and  rarely  had 
superior  power  over  Welch's  test. 

The  first  order  test  of  James  was  similar  to  Welch's  test  in 
terms  of  power  but  did  not  control  the  Type  I  error  rate  as  well. 
James'  second  order  test  was  equivalent  to  Welch's  test  In  terms  of 
size  and  power  but,  because  of  its  complexity  in  calculation  It  was 
not  considered  better  than  Welch's  test. 
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This  leaves  the  choice  of  an  alternative  to  the  ANOVA  F  between 
the  Brown-Forsythe  test  and  the  Welch  test.   Another  surprising  result 
was  that  there  was  only  about  a  1%  to  5%  loss  in  power  as  compared  to 
the  ANOVA  F  when  using  these  two  tests  when  variances  were  equal. 
Unequivocally  it  cannot  be  said  that  one  test  is  consistently  superior 
to  the  other.   The  Brown-Forsythe  test  has  superior  power  only  when 
extreme  means  coincide  with  the  largest  variances.   In  most  other 
cases  Welch's  test  has  superior  power.   For  example,  when  extreme 
means  are  paired  with  the  smallest  variances  or  when  means  are  equally 
spaced  apart  Welch's  test  has  superior  power.   When  two  identical 
means  are  situated  midway  between  two  extreme  means  Welch's  test  once 
again  has  superior  power.   Thus  for  an  experimenter  wanting  to  perform 
an  omnibus  test,  Welch's  test  should  be  used.   There  is  little  doubt 
that  in  some  situations  Welch's  test  will  not  be  the  most  powerful  but 
on  these  occasions  the  loss  In  power  would  only  be  1%  to  25%. 

Another  approach  for  selecting  an  alternative  procedure  would  be 
to  carefully  investigate  the  experimental  data  and  choose  the  test 
that  would  provide  the  most  power,  either  Welch's  test  or  Brown- 
Forsythe  's  test.   The  researcher  could  also  perform  both  tests  and  go 
with  the  alternative  hypothesis  if  either  test  rejects  the  null 
hypothesis . 

Further  research  possibilities  might  include  the  investigation  of 
the  size  and  power  of  the  alternative  tests  when  the  variances  are 
unequal  as  well  as  the  observations  lacking  normality.   Another 
extension  of  this  report  could  be  the  investigation  of  higher  order 
treatment  structures  when  the  variances  are  unequal. 
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Typical  experimental  situations  where  Welch's  test  would  be 
particularly  beneficial  could  be  experiments  involving  preservatives 
in  meat  products  or  experiments  involving  yields  of  grains.   In 
preservative  experiments  a  situation  commonly  referred  to  as  masking 
occurs.   Masking  is  where  a  difference  between  two  means  may  not  be 
detected  because  of  a  large  variance  in  a  third  mean.   This  frequently 
occurs  because  the  controls  used  in  the  experiment  typically  have 
large  variances.   An  example  of  masking  might  be  a  situation  with  a 
sample  size  condition  of  (6,10,16,20),  a  standard  deviation  condition 
of  (1,1,1,4)  and  a  mean  structure  of  (1.2,0,0,0).   In  this  particular 
situation  Welch'  test  is  40%  (54-14)  more  powerful  than  Brown- 
Forsythe's  test.   With  a  sample  size  condition  of  (4,8,10,12)  and  the 
same  standard  deviation  condition  and  mean  structure  as  above  a  28% 
(39-11)  gain  in  power  is  acheived  by  using  Welch's  test.   Even  when 
there  are  six  groups,  see  TABLE  13,  the  superiority  in  power  of 
Welch's  test  can  be  as  high  as  40%. 

For  the  grain  yield  experiment  a  common  situation  occurs  where 
low  yields  have  small  variances  and  high  yields  have  large  variances. 
This  is  a  direct  pairing  situation.   An  example  of  this  direct  pairing 
might  be  the  following:  sample  size  condition  (n.  >  n  >  n,  >  n,), 

2    2    2    2 
variance  condition  (a     >  a.   >  a     >  a   ) ,    and  mean  structure 

^f-^  -*  1^2  ''  ''i  ''  '^4^  •  ^°'"  '"'^■'■^  situation  the  Welch  test  had  a  power  of 
90%  and  the  Brown-Forsythe  test  had  a  power  of  82%.  Other  examples  of 
this  direct  pairing  are  shown  in  TABLE  17. 
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These  are  just  a  few  of  the  typical  situations  where  an 
alternative  to  the  ANOVA  F  such  as  the  Welch  test  could  be  very 
beneficial  to  researchers.   In  general,  whether  the  sample  sizes  are 
equal  or  not,  if  the  variances  are  unequal  the  Welch  test  should  be 
used  because  of  its  excellent  control  of  the  Type  I  error  rate  and 
usually  superior  power. 
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ABSTRACT 

Alternative  procedures  for  testing  the  equality  of  population 
means  when  the  assumption  of  equal  variances  is  violated  are 
discussed.   A  numerical  example  is  illustrated  for  each  alternative 
procedure.   Five  papers  which  compare  the  performance  of  these 
alternatives  are  reviewed.   The  tests  are  appraised  in  terms  of 
significance  level  and  power.   Whereas  other  procedures  had 
limitations  in  several  contexts,  the  findings  of  this  report  indicate 
the  test  by  Welch  is  the  test  of  choice  because  of  its  excellent 
control  of  the  Type  I  error  rate  and  usually  superior  power. 


